Assignment instructions:

Assignment submission (Brooke Grazda): ______________________________________


library(tidyverse)
library(here)
library(janitor)
library(estimatr)  
library(performance)
library(jtools)
library(ggridges)
library(gt)
library(gtsummary)
library(MASS) ## NOTE: The `select()` function is masked. Use: `dplyr::select()` ##
library(interactions) 
library(ggplot2)

DATA SOURCE:

Reed D. 2019. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012. Environmental Data Initiative. https://doi.org/10.6073/pasta/a593a675d644fdefb736750b291579a0. Dataset accessed 11/17/2019.


Introduction

You’re about to dive into some deep data collected from five reef sites in Santa Barbara County, all about the abundance of California spiny lobsters! 🦞 Data was gathered by divers annually from 2012 to 2018 across Naples, Mohawk, Isla Vista, Carpinteria, and Arroyo Quemado reefs.

Why lobsters? Well, this sample provides an opportunity to evaluate the impact of Marine Protected Areas (MPAs) established on January 1, 2012 (Reed, 2019). Of these five reefs, Naples, and Isla Vista are MPAs, while the other three are not protected (non-MPAs). Comparing lobster health between these protected and non-protected areas gives us the chance to study how commercial and recreational fishing might impact these ecosystems.

We will consider the MPA sites the treatment group and use regression methods to explore whether protecting these reefs really makes a difference compared to non-MPA sites (our control group). In this assignment, we’ll think deeply about which causal inference assumptions hold up under the research design and identify where they fall short.

Let’s break it down step by step and see what the data reveals! 📊


Step 1: Anticipating potential sources of selection bias

a. Do the control sites (Arroyo Quemado, Carpenteria, and Mohawk) provide a strong counterfactual for our treatment sites (Naples, Isla Vista)? Write a paragraph making a case for why this comparison is centris paribus or whether selection bias is likely (be specific!).

The control sites may provide strong counterfactual for our treatment sites in that the MPA areas may have spillover effects present, and the comparison may be skewed. The comparison may not be exactly centris paribus, because there are natural limits to the different areas. However, selection bias may be likely due to the spillover effect where the samples may move around across different sites. The lobsters sampled outside of an MPA versus a non-MPA raises the issue of how these lobsters are constrained to a given area, which they are not. Studying lobster health through lobster samples may not be a great indicator of commercial and recreational fishing between protected and non protected areas due to this selection bias.


Step 2: Read & wrangle data

a. Read in the raw data. Name the data.frame (df) rawdata

b. Use the function clean_names() from the janitor package

# HINT: check for coding of missing values (`na = "-99999"`)

rawdata <- read_csv(here::here('data', 'spiny_abundance_sb_18.csv'),  na = "-99999") |> 
    clean_names()

c. Create a new df named tidyata. Using the variable site (reef location) create a new variable reef as a factor and add the following labels in the order listed (i.e., re-order the levels):

"Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples"
tidydata <- rawdata |> 
    mutate(reef = factor(site, 
                         levels = c("AQUE", "CARP", "MOHK", "IVEE", "NAPL"),
                         labels = c("Arroyo Quemado", "Carpenteria", "Mohawk", "Isla Vista",  "Naples")))

Create new df named spiny_counts

d. Create a new variable counts to allow for an analysis of lobster counts where the unit-level of observation is the total number of observed lobsters per site, year and transect.

e. Create a new variable mpa with levels MPA and non_MPA. For our regression analysis create a numerical variable treat where MPA sites are coded 1 and non_MPA sites are coded 0

#HINT(d): Use `group_by()` & `summarize()` to provide the total number of lobsters observed at each site-year-transect row-observation. 

#HINT(e): Use `case_when()` to create the 3 new variable columns

spiny_counts <- tidydata %>%
    group_by(site, year, transect) %>%              
    summarise(counts = sum(count, na.rm = TRUE), mean_size = mean(size_mm, na.rm = TRUE)) |> 
    mutate(mpa = case_when(site %in% c('IVEE', 'NAPL') ~ 'MPA',
                           .default = "non_MPA")) |>
    mutate(treat = case_when(mpa == "MPA" ~ 1,
                             .default = 0)) |> 
    ungroup()

NOTE: This step is crucial to the analysis. Check with a friend or come to TA/instructor office hours to make sure the counts are coded correctly!


Step 3: Explore & visualize data

a. Take a look at the data! Get familiar with the data in each df format (tidydata, spiny_counts)

b. We will focus on the variables count, year, site, and treat(mpa) to model lobster abundance. Create the following 4 plots using a different method each time from the 6 options provided. Add a layer (geom) to each of the plots including informative descriptive statistics (you choose; e.g., mean, median, SD, quartiles, range). Make sure each plot dimension is clearly labeled (e.g., axes, groups).

Create plots displaying the distribution of lobster counts:

  1. grouped by reef site
  2. grouped by MPA status
  3. grouped by year

Create a plot of lobster size :

  1. You choose the grouping variable(s)!
# plot 1: Lobster counts grouped by reef site

spiny_counts |> 
ggplot(aes(x = site, y = counts, colour = site)) +
    geom_boxplot() +
    theme_minimal() +
    labs(x = 'Reef Site',
         y= 'Count of Lobsters') +
    theme(legend.position = 'none')

# Plot 2: Lobster counts grouped by MPA status
spiny_counts |> 
    ggplot(aes(x = counts, y = factor(mpa),
                      fill = after_stat(density))) +
  geom_density_ridges_gradient(scale = 1.2) +
    scale_fill_viridis_c() +
    theme_minimal() +
    labs(y = "MPA Status", x = "Lobster Counts", fill = "Density",
         title = "Lobster Counts Grouped by MPA Status")

# Plot 3: Lobster counts grouped by year
spiny_counts |> 
    ggplot(aes(x = counts, fill = factor(year), color = factor(year))) +
    geom_density() +
    theme_minimal() 

# Plot 4: Lobster size grouped by reef
tidydata |> 
    ggplot(aes(x = size_mm, fill = factor(site), color = factor(site))) +
    geom_histogram(position = "stack")

c. Compare means of the outcome by treatment group. Using the tbl_summary() function from the package gt_summary

# USE: gt_summary::tbl_summary()
spiny_counts %>%
    dplyr::select(counts, mean_size, mpa) |> 
    tbl_summary(by = mpa, statistic = list(all_continuous() ~ "{mean}")) |> 
    modify_header(label ~ "**Variable**")
Variable MPA
N = 119
1
non_MPA
N = 133
1
counts 28 23
mean_size 76 73
    Unknown 12 15
1 Mean

Step 4: OLS regression- building intuition

a. Start with a simple OLS estimator of lobster counts regressed on treatment. Use the function summ() from the jtools package to print the OLS output

b. Interpret the intercept & predictor coefficients in your own words. Use full sentences and write your interpretation of the regression results to be as clear as possible to a non-academic audience.

# NOTE: We will not evaluate/interpret model fit in this assignment (e.g., R-square)

m1_ols <- lm(data = spiny_counts,
             counts ~ site)

summ(m1_ols, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 8.61 5.63 1.53 0.13
siteCARP 17.09 7.51 2.27 0.02
siteIVEE 32.53 7.71 4.22 0.00
siteMOHK 38.15 10.29 3.71 0.00
siteNAPL 7.88 7.51 1.05 0.30
Standard errors: OLS

c. Check the model assumptions using the check_model function from the performance package

d. Explain the results of the 4 diagnostic plots. Why are we getting this result?

We are getting these slightly off results because the data has outliers while the model assumes that the data follows a linear relationship, has equal spread, multicollinearity, and normal residuals. As we see in the plots, it is not quite all there with the data we are using. Perhaps this indicates that the OLS model is not the optimal model.

For the normality of residuals plot, the residuals are not randomly distributed along the line but rather follow a curved trajectory. This indicates that OLS may not be the best model.

check_model(m1_ols,  check = "qq" )

check_model(m1_ols, check = "normality")

In this plot, the residuals are not normally distributed, violating OLS.

check_model(m1_ols, check = "homogeneity")

Here, the variance is not constant, contradicting with OLS assumptions.

check_model(m1_ols, check = "pp_check")


Step 5: Fitting GLMs

a. Estimate a Poisson regression model using the glm() function

b. Interpret the predictor coefficient in your own words. Use full sentences and write your interpretation of the results to be as clear as possible to a non-academic audience.

The predictor coefficient estimates, for each site, an increase of how many counts will be predicted at each site.

c. Explain the statistical concept of dispersion and overdispersion in the context of this model.

The concept of dispersion of this model refers to the spread of the data, with Poisson models always assuming that the lobster count is equally dispersed among each site, or treatment group. The overdispersion in this model is related to the counts of the lobsters not being equally distributed, with the highest counts being in IVEE and MOHK treatment sites.

d. Compare results with previous model, explain change in the significance of the treatment effect

The treatment effect with the Poisson regression shows the percent change for a one unit increase in the predictor for the treatment sites. Therefore the treatment effect now has a predicted percent change for estimating lobster counts.

#HINT1: Incidence Ratio Rate (IRR): Exponentiation of beta returns coefficient which is interpreted as the 'percent change' for a one unit increase in the predictor 

#HINT2: For the second glm() argument `family` use the following specification option `family = poisson(link = "log")`

m2_pois <- glm(counts ~ treat,
               data = spiny_counts, 
               family = poisson(link ='log'))

summ(m2_pois, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type Generalized linear model
Family poisson
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.02 171.74 0.00
treat 0.21 0.03 8.44 0.00
Standard errors: MLE

e. Check the model assumptions. Explain results.

The model assumes that the observations are independent, the probability of counts at the treatment site are only dependent on the incidence ratio rate, the mean equals its variance, and the model is linear. In our results, there is a .65-1.56 percent change range for the probability of lobster incidences at a given site. This means that at a site like Naples, the expected count decreases compared to the other sites increased probability that have a coefficient greater than 1.

f. Conduct tests for over-dispersion & zero-inflation. Explain results.

There is a significant level of overdispersion in the Poisson regression model, likely due to the variance exceeding the mean. The model is underfitting the zero counts, suggesting the zero inflation that is detected. This means that the dataset has more zeros than the model is expecting, likely due to many observations where no lobsters were counted.

check_model(m2_pois)

check_overdispersion(m2_pois)
## # Overdispersion test
## 
##        dispersion ratio =    67.033
##   Pearson's Chi-Squared = 16758.289
##                 p-value =   < 0.001
check_zeroinflation(m2_pois)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 0
##             Ratio: 0.00

g. Fit a negative binomial model using the function glm.nb() from the package MASS and check model diagnostics

h. In 1-2 sentences explain rationale for fitting this GLM model.

Negative binomial models are good for when our data is overdispersed, which we checked in the last step that the variance of lobster counts highly exceeds the mean.

i. Interpret the treatment estimate result in your own words. Compare with results from the previous model.

The treatment estimate for each site shows the log count change indicated by the coefficient. The results performed much better than the previous model, with no overdispersion, or zero inflation. The model predicted intervals included the data points, and the model was overall much more accurate.

# NOTE: The `glm.nb()` function does not require a `family` argument

m3_nb <- glm.nb(data = spiny_counts,
                counts ~ treat)

summ(m3_nb, model.fit = FALSE) 
Observations 252
Dependent variable counts
Type Generalized linear model
Family Negative Binomial(0.55)
Link log
Est. S.E. z val. p
(Intercept) 3.12 0.12 26.40 0.00
treat 0.21 0.17 1.23 0.22
Standard errors: MLE
check_overdispersion(m3_nb)
## # Overdispersion test
## 
##  dispersion ratio = 1.398
##           p-value = 0.088
check_zeroinflation(m3_nb)
## # Check for zero-inflation
## 
##    Observed zeros: 27
##   Predicted zeros: 30
##             Ratio: 1.12
check_predictions(m3_nb)

check_model(m3_nb)


Step 6: Compare models

a. Use the export_summ() function from the jtools package to look at the three regression models you fit side-by-side.

c. Write a short paragraph comparing the results. Is the treatment effect robust or stable across the model specifications.

The Poisson and negative binomial models differ form the OLS model because the OLS model coefficients are much larger. This is due to the OLS coefficients model absolute difference in counts, while the Poisson and Negative binomial models are on the log scale and interpreted as incidence rate ratios when exponentiated. With that being said, the poisson and negative binomial models are more consistent, with comparable standard errors. The fact that the negative binomial accounted for the overdispersion present int he poisson model makes this the most robust and reliable.

export_summs(m1_ols, m2_pois, m3_nb,
             model.names = c("OLS","Poisson", "NB"),
             statistics = "none")
OLSPoissonNB
(Intercept)8.61    3.12 ***3.12 ***
(5.63)   (0.02)   (0.12)   
siteCARP17.09 *                
(7.51)                 
siteIVEE32.53 ***              
(7.71)                 
siteMOHK38.15 ***              
(10.29)                 
siteNAPL7.88                  
(7.51)                 
treat       0.21 ***0.21    
       (0.03)   (0.17)   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Step 7: Building intuition - fixed effects

a. Create new df with the year variable converted to a factor

b. Run the following OLS model using lm()

c. Take a look at the regression output. Each coefficient provides a comparison or the difference in means for a specific sub-group in the data. Informally, describe the what the model has estimated at a conceptual level (NOTE: you do not have to interpret coefficients individually)

d. Explain why the main effect for treatment is negative? *Does this result make sense?

ff_counts <- spiny_counts %>% 
    mutate(year=as_factor(year))
    
m5_fixedeffs <- lm(
    log(counts+1) ~ treat*year,
    data = ff_counts)

summ(m5_fixedeffs, model.fit = FALSE)
Observations 252
Dependent variable log(counts + 1)
Type OLS linear regression
Est. S.E. t val. p
(Intercept) 1.95 0.27 7.26 0.00
treat -1.23 0.39 -3.16 0.00
year2013 -0.27 0.38 -0.71 0.48
year2014 0.02 0.38 0.04 0.97
year2015 0.49 0.38 1.30 0.20
year2016 0.61 0.38 1.61 0.11
year2017 1.04 0.38 2.73 0.01
year2018 0.83 0.38 2.18 0.03
treat:year2013 1.16 0.55 2.10 0.04
treat:year2014 1.85 0.55 3.35 0.00
treat:year2015 2.25 0.55 4.08 0.00
treat:year2016 0.95 0.55 1.71 0.09
treat:year2017 1.22 0.55 2.20 0.03
treat:year2018 2.27 0.55 4.12 0.00
Standard errors: OLS

e. Look at the model predictions: Use the interact_plot() function from package interactions to plot mean predictions by year and treatment status.

f. Re-evaluate your responses (c) and (b) above.

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

interact_plot(m5_fixedeffs, pred = year, modx = treat,
              outcome.scale = "response")

g. Using ggplot() create a plot in same style as the previous interaction plot, but displaying the original scale of the outcome variable (lobster counts). This type of plot is commonly used to show how the treatment effect changes across discrete time points (i.e., panel data).

The plot should have… - year on the x-axis - counts on the y-axis - mpa as the grouping variable

# Hint 1: Group counts by `year` and `mpa` and calculate the `mean_count`
# Hint 2: Convert variable `year` to a factor

plot_counts <- 

# plot_counts %>% ggplot() ...

Step 8: Reconsider causal identification assumptions

  1. Discuss whether you think spillover effects are likely in this research context (see Glossary of terms; https://docs.google.com/document/d/1RIudsVcYhWGpqC-Uftk9UTz3PIq6stVyEpT44EPNgpE/edit?usp=sharing)

  2. Explain why spillover is an issue for the identification of causal effects

  3. How does spillover relate to impact in this research setting?

  4. Discuss the following causal inference assumptions in the context of the MPA treatment effect estimator. Evaluate if each of the assumption are reasonable:

    1. SUTVA: Stable Unit Treatment Value assumption
    2. Excludability assumption

EXTRA CREDIT

Use the recent lobster abundance data with observations collected up until 2024 (lobster_sbchannel_24.csv) to run an analysis evaluating the effect of MPA status on lobster counts using the same focal variables.

  1. Create a new script for the analysis on the updated data
  2. Run at least 3 regression models & assess model diagnostics
  3. Compare and contrast results with the analysis from the 2012-2018 data sample (~ 2 paragraphs)